Fitting a Gradient Boosting Machine (GBM) and publishing to AzureML using R

Introduction

In this notebook you fit a Gradient Boosting Machine (GBM) model using R, and then publish the model as a web service on the Azure Machine Learning platform.

Target audience

You should have some experience with R and know a little about Azure ML web services

Why GBM in R notebooks

GBM is well-known among data scientists and as a Kaggle Profile explains, it has several major advantages compared with traditional statistical models like linear regression:

  • it automatically approximates non-linear transformations and interactions
  • it treats missing values without having to fill in values or remove observations
  • monotonic transformation of features won't influence the model's performance

For users who are used to fitting GBM models in Azure ML Experiments, a major advantage of using Azure ML notebooks is that there are more modeling options. For example, when the response variable is continunous you can use the "Boosted Decision Tree Regression" module for Experiments to fit a GBM model. This module, however, does not allow users to specify the types of loss functions (for statisticians, this means that you can't specify the distribution for the response variable). On the other hand, with the gbm package in R, you can choose from a wide variety of loss functions.

Data

In this example, you use the housing data from the R package MASS. There are 506 rows and 14 columns in the dataset. Available information includes median home price, average number of rooms per dwelling, crime rate by town, etc. Find more information about this dataset typing help(Boston) or ?Boston in an R terminal, or at this UCI page.


In [1]:
library(MASS) # to use the Boston dataset
?Boston


Out[1]:
No documentation for 'Boston' in specified packages and libraries: you could try '??Boston'

GBM model

Estimate hyperparameters

In a GBM model, there are several hyperparameters and we need to estimate them first. One way to estimate these parameters is to use cross validation on a parameter-grid. In our example, we'll optimize the following parameters over a grid: number of estimators, maximum tree depth, minimum number of samples on a split, and learning rate. To do this we start by providing several values for each of them and create a set of combinations, each combination consisting of one value for each parameter. Then for each combination we use cross validation to estimate the performance, using root mean squared error as performance metric. The "caret" package will be used in this process.


In [2]:
# load the libraries
if(!require("gbm")) install.packages("gbm")
library(gbm)


Warning message:
: package 'MASS' was built under R version 3.2.3Warning message:
: package 'gbm' was built under R version 3.2.3Loading required package: survival
Loading required package: splines
Loading required package: parallel
Loaded gbm 2.1.1

In [ ]:


In [3]:
model1 <- gbm(medv ~ ., data = Boston, 
            distribution = "gaussian",
            n.trees = 5000,
            interaction.depth = 2, 
            n.minobsinnode = 1, 
            shrinkage = 0.001)

In [4]:
# summarize the model
options(repr.plot.width = 4, repr.plot.height = 4)
summary(model1)


Out[4]:
varrel.inf
lstatlstat42.40815
rmrm37.56727
disdis7.985178
crimcrim4.02818
noxnox3.488242
ptratioptratio2.775456
taxtax0.8367165
ageage0.3748081
chaschas0.1964425
blackblack0.1819871
indusindus0.1090909
radrad0.04847584
znzn0

In [5]:
# plot cv results
plot(model1)


Fit Model with Estimated Parameters

With the selected parameter values from above, we can fit a GBM model.


In [6]:
# fit the model

model2 <- gbm(medv ~ ., data = Boston, 
            distribution = "gaussian",
            n.trees = 10000,
            interaction.depth = 4, 
            n.minobsinnode = 1, 
            shrinkage = 0.01)

summary(model2)


Out[6]:
varrel.inf
lstatlstat37.72131
rmrm32.2404
disdis9.14701
crimcrim5.212688
noxnox3.867319
ptratioptratio3.336715
ageage2.475507
blackblack2.08439
taxtax1.884501
indusindus0.8155508
radrad0.7941646
znzn0.211578
chaschas0.2088622

For the fitted model,we can look closely at how the number of trees affect loss function on training and validation data to select the best value.

Web service

Deploy a web service

With the developed model, we can deploy a web service so that others can use it to make predictions. The "AzureML" package will be used for this purpose.


In [7]:
# load the library
library(AzureML)

# workspace information
ws <- workspace()

# define predict function
predict_gbm <- function(newdata){
  require(gbm)
  predict(model2, newdata, n.trees = 1000)
}

# test the prediction function
newdata <- Boston[1:10, ]
pred <- predict_gbm(newdata)

data.frame(actual = newdata$medv, prediction = pred)


Warning message:
: package 'AzureML' was built under R version 3.2.3
Out[7]:
actualprediction
12425.92935
221.621.88954
334.734.35476
433.434.62983
536.233.63816
628.726.69951
722.921.37308
827.120.73458
916.516.23034
1018.918.36818

In [8]:
# Publish the service
ep <- publishWebService(ws = ws, fun = predict_gbm, 
                        name = "HousePricePredictionGBM", 
                        inputSchema = newdata)
str(ep)


Classes 'Endpoint' and 'data.frame':	1 obs. of  14 variables:
 $ Name                 : chr "default"
 $ Description          : chr ""
 $ CreationTime         : chr "2016-03-13T16:19:55.06Z"
 $ WorkspaceId          : chr "a2aba0dafad8436788401bbc8c22fe36"
 $ WebServiceId         : chr "5d78b508e93711e5a09b9be0b5519e78"
 $ HelpLocation         : chr "https://studio.azureml-int.net/apihelp/workspaces/a2aba0dafad8436788401bbc8c22fe36/webservices/5d78b508e93711e5a09b9be0b5519e78"| __truncated__
 $ PrimaryKey           : chr "G0fn3BdxtiTSanKT5FaAg7YhCGJTz4C2Lqxxfx1d5R9QS3u1ecNLzXpJujvdaopFWhVJIUKe/9EOJviJRVO7aQ=="
 $ SecondaryKey         : chr "fr6iGQBRx0C1AWuhOKDzAo31boQorwc3oKaDjDeILU+OmYI+BJ8woGU0T8erb+4i9s+2tr9kY22CsvR+Ep/+qQ=="
 $ ApiLocation          : chr "https://ussouthcentral.services.azureml-int.net/workspaces/a2aba0dafad8436788401bbc8c22fe36/services/c95a1777e055426d97315a5e20"| __truncated__
 $ PreventUpdate        : logi FALSE
 $ GlobalParameters     :List of 1
  ..$ : list()
 $ MaxConcurrentCalls   : int 4
 $ DiagnosticsTraceLevel: chr "None"
 $ ThrottleLevel        : chr "Low"

Consume a web service

With information about the workspace and and service ID, we can consume the web service with the following code.


In [9]:
pred <- consume(ep, newdata)$ans
data.frame(actual = newdata$medv, prediction = pred)


Out[9]:
actualprediction
12425.92935
221.621.88954
334.734.35476
433.434.62983
536.233.63816
628.726.69951
722.921.37308
827.120.73458
916.516.23034
1018.918.36818

Conclusion

Using the Boston housing dataset, we started the analysis by estimating the parameters in the GBM model. Then we fitted the model and examined variable importance. A web service was also deployed based on the selected model.

In addition to the Gaussian distribution which uses squared error loss function, the gbm package allows for several other distributions: laplace which uses absolute loss, t-distribution which uses t-distribution loss, etc.

The caret package makes it possible to easily tune the hyperparameters on a grid.


Created by a Microsoft Employee.
Copyright (C) Microsoft. All Rights Reserved.